ModelChain: Decentralized Privacy-Preserving Healthcare Predictive Modeling Framework on Private Blockchain Networks

نویسندگان

  • Tsung-Ting Kuo
  • Lucila Ohno-Machado
چکیده

Cross-institutional healthcare predictive modeling can accelerate research and facilitate quality improvement initiatives, and thus is important for national healthcare delivery priorities. For example, a model that predicts risk of re-admission for a particular set of patients will be more generalizable if developed with data from multiple institutions. While privacy-protecting methods to build predictive models exist, most are based on a centralized architecture, which presents security and robustness vulnerabilities such as single-point-of-failure (and single-point-of-breach) and accidental or malicious modification of records. In this article, we describe a new framework, ModelChain, to adapt Blockchain technology for privacy-preserving machine learning. Each participating site contributes to model parameter estimation without revealing any patient health information (i.e., only model data, no observation-level data, are exchanged across institutions). We integrate privacypreserving online machine learning with a private Blockchain network, apply transaction metadata to disseminate partial models, and design a new proof-of-information algorithm to determine the order of the online learning process. We also discuss the benefits and potential issues of applying Blockchain technology to solve the privacy-preserving healthcare predictive modeling task and to increase interoperability between institutions, to support the Nationwide Interoperability Roadmap and national healthcare delivery priorities such as Patient-Centered Outcomes Research (PCOR). Introduction Cross-institution interoperable healthcare predictive modeling can advance research and facilitate quality improvement initiatives, for example, by generating scientific evidence for comparative effectiveness research, accelerating biomedical discoveries, and improving patient-care. For example, a healthcare provider may be able to predict certain outcome even if her institution has few or none related patient records. A predictive model can be “learned” (i.e., its parameters can be estimated) from data originating from the other institutions. However, improper data disclosure could place sensitive personal health information at risk. To protect the privacy of individuals, several algorithms (such as GLORE, EXPLORER, and VERTIGO) have been proposed to conduct predictive modeling by transfer of partially-trained machine learning models instead of disseminating individual patient-level data. However, these state-of-the-art distributed privacypreserving predictive modeling frameworks are centralized (i.e., require a central server to intermediate the modeling process and aggregate the global model), as shown in Figure 1(a). Such a client-server architecture carries the following risks: ● Institutional policies. For example, a site may not want to cede control to a single central server. ● Single-point-of-failure. For example, if the central server is shut down for maintenance, the whole network stops working. Furthermore, if the admin user account of the central server gets compromised, the entire network is also under the risk of being compromised. ● Participating sites cannot join/leave the network at any time. If any site joins or leaves the network for a short period of time, the analysis process is disrupted and the server needs to deal with the recovering issue. A new site may not participate in the network without the authentication and reconfiguration on the central server. Figure 1. (a): Centralized topology. (b): Decentralized topology (Blockchain). ● The data being disseminated and the transfer records are mutable. An attacker could change the partial models without being noticed. The transfer records may also be modified so that no audit trail is available to identify such malicious change of data. ● The client-server architecture may present consensus/synchronization issues on distributed networks. Specifically, the issue is the combination of two problems: the Byzantine Generals Problem, in which the participating sites need to agree upon the aggregated model under the constraint that each site may fail due to accidental or even malicious ways, and the Sybil Attack Problem, of which the attacker comprises a large fraction of the seemingly independent participants and exerts unfairly disproportionate influence during the process of predictive modeling. To address the abovementioned risks, one plausible solution is to adapt the Blockchain technology (in this article, we use “Blockchain” to denote the technology, and “blockchain” to indicate the actual chain of blocks). A Blockchain-based distributed network has the following desirable features that make it suitable to mitigate the risks of centralized privacy-preserving healthcare predictive modeling networks. First, Blockchain is by design a decentralized (i.e., a peer-to-peer, non-intermediated) architecture (Figure 1(b)); the verification of transactions is achieved by majority proof-of-work voting. Each institution can keep full control of their own computational resources. Also, there is no risk of single-point-of-failure. Second, each site (including new sites) can join/leave the network freely without imposing overhead on a central server or disrupting the machine learning process. Finally, the proof-of-work blockchain provides an immutable audit trail. That is, changing the data or records is very difficult; the attacker needs to redo proof-of-work of the target block and all blocks after it, and then surpass all honest sites. As shown by Satoshi Nakamoto, the inventor of Blockchain and Bitcoin, given that the probability that an honest node finds the next block is larger than the probability that an attacker finds the next block, the probability the attacker will ever catch up drops exponentially as the number of the blocks by which the attacker lags behind increases. This is also the reason why the Blockchain mechanism also solves the relaxed version of Byzantine Generals Problem and the Sybil Attack Problem, as formally proved by Miller et al. Site 1 Site 2 Site 3 Site 4 Site 1 Site 2 Site 3 Site 4 Central Server (a) Centralized (b) Decentralized (Blockchain) Although Blockchain provides the abovementioned security and robustness benefits, a reasonable approach to integrate Blockchain with the privacy-preserving healthcare predictive modeling algorithms is yet to be devised. In this article, we propose ModelChain, a private-Blockchain-based privacy-preserving healthcare predictive modeling framework, to combine these two important technologies. First, we apply privacy-preserving online machine learning algorithms on blockchains. Intuitively, the incremental characteristic of online machine learning makes it feasible for peer-to-peer networks like Blockchain. Then, we utilize metadata in the transactions to disseminate the partial models and other meta information (i.e., flag (which indicates the type of action) of the model, hash of the model, and error of the model), and thus integrate private blockchains (i.e., the network is available only for participating institutions) with privacy-preserving online machine learning. Finally, we design a new proof-of-information algorithm on top of the original proof-of-work consensus protocol, to determine the order of the online machine learning on blockchains, aiming at increasing efficiency and accuracy. The basic idea of proof-of-information is similar to the concept of Boosting: the site that contains data that cannot be predicted accurately using a current partial model contains more information to improve the model, and thus that site should be assigned a higher priority to be chosen as the next model-updating site. We start with the best model to prevent error propagation, choose the site with highest error for current model to update the model, and repeat the process to update the model until a site cannot find any other site with higher error to update the model. In this case, we consider the model as the consensus model. ModelChain can advance the following interoperability needs stated in the Nationwide Interoperability Roadmap of the Office of the National Coordinator for Health Information Technology (ONC): ● “Build upon the existing health IT infrastructure.” ModelChain exploits the existing healthcare data in Clinical Data Research Networks (CDRNs) such as the Patient-centered SCAlable National Network for Effectiveness Research (pSCANNER), which is one of the Clinical Data Research Networks (CDRNs) in the PCORI-launched PCORnet and includes three networks: VA Informatics and Computing Infrastructure (VINCI), University of California Research eXchange (UCReX), and SCANNER. With the support of the Blockchain backbone, ModelChain can leverage all existing patient data storage infrastructures, while improving the healthcare prediction power for every site. ● “Maintain modularity.” Comparing to traditional client-server architecture, ModelChain inherits the peer-to-peer architecture of Blockchain, allowing each site to remain modular while interoperating with other sites. Also, each site has control about how its data are accessed (instead of ceding control to the central server), thus can keep up with institutional policies. Moreover, Blockchain provides the native ability to automatically coordinate the joining or leaving of each site, further improving the independence and modularity for the participating institutions. ● “Protect privacy and security in all aspects of interoperability.” ModelChain is designed to provide a secure, robust and privacy-preserving interoperability platform. Specifically, Blockchain increases the security by avoiding single-point-of-failure, proving immutable audit trails, and mitigating the Byzantine Generals and the Sybil Attack problems, while preserving the privacy by exchanging zero patient data during the predictive modeling process. The expected benefits of ModelChain can also be linked to the stated objectives of Patient-Centered Outcomes Research (PCOR) defined by the Patient-Centered Outcomes Research Institute (PCORI). Related Work Privacy-preserving predictive modeling Cross-institutional healthcare predictive modeling and machine learning can accelerate research and facilitate quality improvement initiatives. However, improper information exchange of biomedical data can put sensitive personal health information at risk. To protect the privacy of individuals, many algorithms have been proposed to conduct predictive modeling by transfer of partially-trained machine learning models, instead of disseminating individual patient data. For example, GLORE built logistic regression models with horizontally partitioned data, VERTIGO dealt with vertically partitioned data, and WebDISCO constructed Cox proportional hazards model on horizontally partitioned data. Among these distributed privacy-preserving machine learning algorithms, EXPLORER and the Distributed Autonomous Online Learning are “online” machine learning algorithms of which models can be updated in a sequential order (as opposed to the other “batch” algorithms). Such an online machine learning algorithms are similar to our proposed ModelChain that updates models on Blockchain sequentially. However, all these machine learning algorithms, which either update the models in a batch or online fashion, relied on a centralized network architecture that may suffer from security risks such as a single-point-of-failure. In contrast, ModelChain is built on top of Blockchain, which is a decentralized architecture and can provide further security/robustness improvement (e.g., immutable audit trails). Another related area covers distributed data-parallelism machine learning algorithms, such as Parameter Server or compute models using the MapReduce technology. Nevertheless, they mainly focus on the parallelization algorithms to speed-up the computation process, instead of aiming at privacy-preserving data analysis, and thus are different from our method. Blockchain technology for crypto-currency applications Blockchain was first proposed as a proof-of-work consensus protocol implementation of peer-to-peer timestamp server on a decentralized basis in the famous Bitcoin crypto-currency. Specifically, an electronic coin (e.g., Bitcoin) is defined as a chain of transactions. A block contains multiple transactions to be verified, and the blocks are chained (i.e., “blockchain”) using hash functions to achieve the timestamp feature. Then, each site “mines” blocks (to confirm the transactions) by solving a difficult hashing problem (i.e., “proofof-work”). That is, each block contains an additional counter (i.e., “nonce”) as one of the inputs of the hash function, and the nonce is incremented until the hashed value contains specified leading zero bits (then the work is “proofed”). The first site that successfully satisfies the proof-of-work (and thus has the “decision power”) verifies the transactions and adds the confirmed block at the end of the blockchain, and the block is confirmed and is considered “immutable”; if any attacker wants to change a block, all the blocks after it would also require to be recomputed (because each block is computed using the hash of the previous block in the chain). Given the assumption that honest computational sites (i.e., computational power) are larger than malicious sites, the probability that the attacker can recompute and modify a block is extremely small (especially when the attacker has already lagged behind for many blocks). Such a proof-of-work design can also be regarded as majority voting (i.e., one-CPU-one-vote); the longest chain (invested with the heaviest proof-of-work effort) represents the majority decision, and thus no trusted central authority (i.e., “mint”) is required to prevent the double-spending problem (i.e., the transactions are validated by the longest chain the majority of the sites). Several recent researches provide detailed analyses of the Blockchain consensus protocol in terms of its ability to resist attacks. After Bitcoin, several alternatives have also been proposed (alternative blockchains, or “altchains”), such as Colored coins (a protocol to support Bitcoin in different “colors” as different crypto-currencies) and Sidechains (a protocol to allow Bitcoin to be transferred between multiple blockchain networks). Also, several protocol have been proposed on top of Bitcoin’s proof-of-work to increase the difficulty of developing a “Bitcoin monopoly”, such as proof-of-stake (in which the “decision power” is based on the ages of the owned bitcoins; the site with the largest “stake” can confirm and add the new block to the blockchain) and proof-of-burn (in which the “decision power” is based on the destroying of the owned bitcoins; the site that is willing to destroy the largest number of its bitcoins can confirm and add the new block to the blockchain). In this article, we propose a proof-of-information algorithm on top of the proof-of-work, to provide “decision power” (i.e., privilege to update the online machine learning model) to the site with the highest expected amount of information. Blockchain technology for non-financial and healthcare applications Blockchain was created for financial transactions, but it is also a new form of a distributed database, because it can store arbitrary data in the transaction metadata (the metadata has been an official Bitcoin entity since 2014). The original Bitcoin only supports 80 bytes of metadata (via OP_RETURN), but several implementations of Blockchain support a larger metadata size. For example, MultiChain supports adjustable maximum metadata size per transaction. Another example is BigchainDB, which is built on top of a big data database RethinkDB and thus has no hard limit on the transaction size. Here, we utilize the transaction metadata to disseminate the partially trained online machine learning model (and the meta information of the model) among participating sites. Such Blockchain-based distributed database is also known as Blockchain 2.0, including technologies such as smart properties (the properties with blockchain-controlled ownership) and smart contracts (computer programs that manage smart properties). One of the most famous Blockchain 2.0 system is Ethereum, a decentralized platform that runs smart contracts. Ethereum has a built-in Turing-complete programming language that supports loop computation, which is not provided by the Bitcoin scripting language. In the context of a distributed database, smart properties are data entries, and smart contract are stored procedures. Our proof-of-information algorithm may be implemented using Blockchain 2.0 technologies as well, with smart properties being partial models, and smart contracts being the algorithms to update and transfer the partial models. Recently, the concept of Blockchain 3.0 has been proposed to indicate applications beyond currency, economy, and markets. One of the most important application is the adaption of Blockchain technology to the healthcare system. For example, Irving et al. evaluated the idea of using the blockchain as a distributed tamper-proof public ledger, to provide proof of pre-specified endpoints in clinical trial; McKernan proposed to apply decentralized blockchain to store genomic data; and Jenkins et al. discussed a bio-mining framework for biomarkers with a multi-resolution blockchain to perform multi-factor authentication and thus increase data security. There are also studies that propose to use Blockchain to store electronic health records, or to record health transactions. However, to the best of our knowledge, we are the first to propose the adoption of Blockchain to improve the security and robustness of privacy-preserving healthcare predictive modeling. The ModelChain Framework In ModelChain, we apply privacy-preserving online machine learning algorithms on blockchains. Intuitively, the incremental characteristics of online machine learning is feasible for peer-to-peer networks like Blockchain. It should be noted that any online learning algorithm, such as EXPLORER or Distributed Autonomous Online Learning, can be adapted in our framework. Figure 2. An example of ModelChain. Each block represents a timestamp, containing only one transaction. Each transaction contains a model, flag (action type) of the model, hash of the model, and error of the model. Next, we utilize the metadata in the transactions to disseminate the partial models and the meta information (i.e., flag of the model, hash of the model, and error of the model) to integrate privacy-preserving online machine learning with a private Blockchain network (Figure 2). There are four types of flag in ModelChain: INITIALIZE, UPDATE, EVALUATE, and TRANSFER, which indicates the action a site has taken to a model (e.g., INITIALIZE = the site initialized the model). We include the hash of the model to save storage spaces (i.e., only UPDATE transactions include both model and hash of model; all other type of transactions only include hash of the model (and model = NULL) to reduce the size of blockchain). In a transaction, both the amount of the transactions and the transaction fees are set to be zero. Also, in this private Blockchain network, no block mining reward is provided. The incentive for each site to mine blocks and verify transactions is the improved accuracy of the predictive model using cross-institution data in a privacy-preserving manner. Besides, a block can only contain one transaction (so each transaction has a unique timestamp). The private blockchain containing all blocks of transactions can be regarded as a distributed database (or data ledger) that every site can read and write to. We then use this Blockchain-based private distributed database as a basis of the proofof-information algorithm. Finally, we designed a new proof-of-information algorithm on top of the original proof-of-work consensus protocol, to determine the order of the online machine learning on blockchains, aiming at increasing efficiency and accuracy. The basic idea is similar to the concept of Boosting: the site which contains data that cannot be predicted accurately using current partial model probably contains more information to improve the model than other sites, and thus that site should be assigned a higher priority to be chosen as the next model-updating site. A running example of the proof-of-information algorithm is shown in Figure 3. Suppose there are four participating sites that would like to train a privacy-preserving online machine model on the private Blockchain network. Assume Mts = model at time t on site s, Ets = error at time t on site s. In the initialization stage (t = 0), each site trains their own model using their local patient data, and the model with lowest error (Site 1 with E01 = 0.2 in our example) is selected as the initial model. The reason to choose the best model is to prevent the propagation of error. Conceptually, we regard M01 is “transferred” from Site 1 to Site 1 itself. Then, the selected model (M01) is submitted to Site 2, 3 and 4. Next (t = 1), each site evaluates the model M11 (which is the same as M01) using their local data. Suppose Site 2 has the highest error (E12 = 0.7). Given that the data in Site 2 is the most unpredictable for model M11, we assume that Site 2 contains the richest information to improve M11. Therefore, Site 2 wins the “information bid”, and the model M11 is now transferred to Site 2 within the block B1 (with amount = 0 and transaction fee = 0) shown in Figure 2. It should be noted that the Blockchain protocol requires every site to submit every transaction to each other for verification. Therefore, M11 is actually submitted from Site 1 to every site. However, since Site 2 wins the “information bid”, we conceptually regard that M11 is “transferred” from Site 1 to Site 2, in the sense that only Site 2 can update M11 using the local patient data in Site 2. Block B1 Hash of Block B0 Nonce N1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

FairAccess: a new Blockchain-based access control framework for the Internet of Things

Security and privacy are huge challenges in Internet of Things (IoT) environments, but unfortunately, the harmonization of the IoT-related standards and protocols is hardly and slowly widespread. In this paper, we propose a new framework for access control in IoT based on the blockchain technology. Our first contribution consists in providing a reference model for our proposed framework within ...

متن کامل

A centralized privacy-preserving framework for online social networks

There are some critical privacy concerns in the current online social networks (OSNs). Users' information is disclosed to different entities that they were not supposed to access. Furthermore, the notion of friendship is inadequate in OSNs since the degree of social relationships between users dynamically changes over the time. Additionally, users may define similar privacy settings for their f...

متن کامل

RoboChain: A Secure Data-Sharing Framework for Human-Robot Interaction

Robots have potential to revolutionize the way we interact with the world around us. One of their largest potentials is in the domain of mobile health where they can be used to facilitate clinical interventions. However, to accomplish this, robots need to have access to our private data in order to learn from these data and improve their interaction capabilities. Furthermore, to enhance this le...

متن کامل

[Proceeding] Blockchain for the Internet of Things: a Systematic Literature Review

In the Internet of Things (IoT) scenario, the blockchain and, in general, Peer-to-Peer approaches could play an important role in the development of decentralized and dataintensive applications running on billion of devices, preserving the privacy of the users. Our research goal is to understand whether the blockchain and Peer-to-Peer approaches can be employed to foster a decentralized and pri...

متن کامل

Secure and Trustable Electronic Medical Records Sharing using Blockchain

Electronic medical records (EMRs) are critical, highly sensitive private information in healthcare, and need to be frequently shared among peers. Blockchain provides a shared, immutable and transparent history of all the transactions to build applications with trust, accountability and transparency. This provides a unique opportunity to develop a secure and trustable EMR data management and sha...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1802.01746  شماره 

صفحات  -

تاریخ انتشار 2016